The goal of this project was to find the best performing supervised machine learning model with optimized hyperparameters to efficiently predict the class of a rice grain based on its extracted characteristics.
The data consisted of images of arborio, basmati, ipsala, jasmine and caracadag rice grains against a black background, 500 samples in total. The project follows the findings of a previous study by Cinar et.al. entitled Identification of Rice Varieties Using Machine Learning Algorithms, and the used samples were randomly picked from a collection of 75 000 images using a seed (50).
For preprocessing, the data was standardized and dimensionality reduction (PCA) was applied for proper visualization of the clusters. Missing values were processed accordingly and the features were extracted based on their morphological, color and shape features. Random Forest, Support Vector Machine and MLP were chosen as the model candidates and the final cross validation was performed as a nested cross-validation which revealed that the model with the best accuracy is the Support Vector Machine.
Original research article:
İ. Çınar and M. Koklu. Identification of rice varieties using machine learning algorithms. Journal of Agricultural Sciences, 28(2):307–325, 2022. doi: 10.15832/ankutbd.862482.
# Importing necessary libraries for data manipulation and visualization
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
# Importing libraries for file manipulation and image processing
import glob, os
import cv2 as cv
# Importing libraries for statistical analysis
from scipy.stats import kurtosis, skew
# Importing libraries for machine learning
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RepeatedKFold, GridSearchCV, StratifiedKFold, cross_val_predict
from sklearn import svm
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn import metrics
# Importing libraries for miscellaneous functionalities
from random import sample, seed
from IPython.display import display, HTML
import warnings
import time
The rice sample images were imported from the link http://www.muratkoklu.com/datasets/vtdhnd09.php. These were separated into folders by rice species: 'Arborio', 'Basmati', 'Ipsala', 'Jasmine', and 'Karacadag'. Then, 100 random images were sampled from each species, totaling 500 images. The seed(50) method was used for reproducibility.
# import data
# set the seed for enabling the reproduction with the same sequence
seed(50)
path = '../data'
folders = ['Arborio', 'Basmati', 'Ipsala', 'Jasmine', 'Karacadag']
all_images = []
subset = []
for folder in folders:
path_folder = os.path.join(path, folder)
# all .jpg files from the given folder
files = glob.glob(os.path.join(path_folder, '*.jpg'))
# gather all sampled filenames in subset list
subset.extend(sample(files, 100))
# Gathering the sampled images into a list
image_list =[]
# Iterating through the subset list and reading the images
for image in subset:
image_list.append(cv.imread(image))
basmati (244).jpg is used as a test image
# Saving the test image
test_image = cv.imread( '../data\\Basmati\\basmati (244).jpg')
The contours of each rice were determined using the findContours function from OpenCV. The contour for the test image was also identified. To visualize these results, both the original test image and its image with the contour were plotted.
To avoid modifying the original image when using drawCountours, a copy of the test image was utilized as input for the function.
# Function for finding contours
def find_contours (image):
# Grayscale conversion
img_gray = cv.cvtColor(image, cv.COLOR_BGR2GRAY)
# Binary thresholding (0 is black, 255 is white. Values greater than 150 = white)
ret, thresh = cv.threshold(img_gray, 127, 255, 0)
# Identify contours, RETR_TREE retrieves all the contours and creates a tree hierarchy (suitable for the grains against a black background), CHAIN_APPROX_NONE stores all contour points
contours, hierarchy = cv.findContours(image=thresh, mode=cv.RETR_TREE, method=cv.CHAIN_APPROX_SIMPLE)
return contours
# Copy of the test image to avoid overwrite
test_copy = test_image.copy()
# Drawing of the contours to the copy, ContourIdx=-1 to draw all contour lines and LINE_AA for an anti-aliased line
cv.drawContours(image=test_copy, contours=find_contours(test_copy), contourIdx=-1, color=(0,255,0), thickness=1, lineType=cv.LINE_AA)
# Display the original image and the contours
plt.subplot(1, 2, 1)
plt.imshow(cv.cvtColor(test_image, cv.COLOR_BGR2RGB))
plt.title('Original Image')
plt.axis('off')
# Contours Image
plt.subplot(1, 2, 2)
plt.imshow(cv.cvtColor(test_copy, cv.COLOR_BGR2RGB))
plt.title('Contours')
plt.axis('off')
(-0.5, 249.5, 249.5, -0.5)
# List containing contour values for each image
contours_list = []
for image in image_list:
contours_list.append(find_contours(image))
In the first part, the images are loaded and gathered to a list (image_list).
The function for finding contours is used to gather all contours to contours_list.
The test image (Basmati (244)) also utilizes the function and a copy is used to display the original image and it's contours
Main source: https://learnopencv.com/contour-detection-using-opencv-python-c/#Drawing-Contours-using-CHAIN_APPROX_NONE
Color features extraction. RGB images are converted to YCbCr format
ycrcb_list = []
# Converting the images to YCrCb color space and appending them to the list
for image in image_list:
ycrcb_list.append(cv.cvtColor(image, cv.COLOR_BGR2YCrCb))
The images are stored in YCrCb -format to a list (OpenCV's defined ordering for the conversion). The output is in the form of "[1,2,3]" where:
1 = Y (Luma) - The brightness of the pixel
2 = Cr (Chrominance red) - The red difference
3 = Cb (Chrominance blue) - The blue difference
def pixels_in_contour(image, contour):
# Initialize lists to store Y, Cb, Cr values within contour
y_values = []
cr_values = []
cb_values = []
# Iterate over each pixel in the image
for y in range(image.shape[0]):
for x in range(image.shape[1]):
# Check if the pixel is within the contour
if cv.pointPolygonTest(contour, (x, y), measureDist=False) == 1:
# Get YCrCb values at the pixel location
y_value, cr_value, cb_value = image[y, x] # indexing is reversed
# Append Y, Cb, Cr values to lists
y_values.append(y_value)
cr_values.append(cr_value)
cb_values.append(cb_value)
return y_values, cr_values, cb_values
all_ycrcb_values = {}
# Iterate over each image and its contours and add Y, Cb, Cr values to the dictionary
for indx, image in enumerate(ycrcb_list):
y_values, cr_values, cb_values = pixels_in_contour(image, contours_list[indx][0])
all_ycrcb_values[indx] = {'y_values': y_values, 'cr_values': cr_values, 'cb_values': cb_values}
def comp_stats(component):
# Calculate the mean, variance, skewness, and kurtosis
mean = np.mean(component)
var = np.var(component)
skewness = skew(component)
kurt = kurtosis(component)
return mean, var, skewness, kurt
y_stats, cr_stats, cb_stats = [], [], []
# Add all Y, Cb, Cr stats to respective lists
for i in range(500):
y_stats.append(comp_stats(all_ycrcb_values[i]['y_values']))
cr_stats.append(comp_stats(all_ycrcb_values[i]['cr_values']))
cb_stats.append(comp_stats(all_ycrcb_values[i]['cb_values']))
# Suppress specific runtime warnings
warnings.filterwarnings("ignore", message="RuntimeWarning:")
Testing if point x = 125, y = 160 is within the contour in the test_image
test_contour = find_contours(test_image)[0]
if cv.pointPolygonTest(test_contour, (125, 160), measureDist=False) >= 0:
print("Point (125, 160) is inside the contour")
else:
print("Point (125, 160) is outside the contour")
Point (125, 160) is inside the contour
Mean values of Y, Cb and Cr components for the test image (within the contour)
# Test image Y, Cb, Cr means
test_ycrcb = cv.cvtColor(test_image, cv.COLOR_BGR2YCrCb)
test_y_values, test_cr_values, test_cb_values = pixels_in_contour(test_ycrcb, test_contour)
test_mean_y, test_mean_cr, test_mean_cb = np.mean(test_y_values), np.mean(test_cr_values), np.mean(test_cb_values)
print("Test image Y mean: ", test_mean_y, "\nTest image Cr mean: ", test_mean_cr, "\nTest image Cb mean: ", test_mean_cb)
Test image Y mean: 223.90895232815964 Test image Cr mean: 126.90562638580931 Test image Cb mean: 133.640243902439
In the second part, the images are converted to YCrCb-format due to openCV's conversion mode.
These are then scanned pixel by pixel to determine their values at their corresponding contours coordinates.
The values are then used to calculate the mean, variance, skewness and kurtosis of each YCrCb component for the sample sets and test images.
Main source: https://docs.opencv.org/3.4/d8/d01/group__imgproc__color__conversions.html#ga397ae87e1288a81d2363b61574eb8cab
Dimension features extraction.
# Function for ellipse fitting
def fit_ellipse(contour):
# Fit an ellipse to the contour
ellipse = cv.fitEllipse(contour)
return ellipse
# Ellipses for all the images (Had to lower the threshold value in find_contours, since all images didn't have contours)
ellipse_list = []
for indx, image in enumerate(image_list):
contours = contours_list[indx]
cnt = contours[0] # the first contour
ellipse = cv.fitEllipse(cnt)
ellipse_list.append(ellipse)
# The rice species in the image_list are: 0-100 Arborio, 101-200 Basmati, 201-300 Ipsala, 301-400 Jasmine, 401-500 Karacadag
# We only need to extract the contours for the first image of each species
fig, axs = plt.subplots(1, 5, figsize=(15, 4))
for idx, i in enumerate([0, 100, 200, 300, 400]):
image_copy = image_list[i].copy()
cv.ellipse(image_copy, ellipse_list[i], (0,255,0), 2)
# Species name fetched from the directory name in subset list
axs[idx].imshow(image_copy)
axs[idx].set_title(os.path.basename(os.path.dirname(subset[i])))
axs[idx].axis('off')
# Dimension features
MajorAxisLength = []
MinorAxisLength = []
AreaContour = []
PerimeterContour = []
EquivalentDiameter = []
Compactness = []
Shape_Factor_1 = []
Shape_Factor_2 = []
# Calculating the features for all of the images
for indx, ellipse in enumerate(ellipse_list):
L = max(ellipse[1]); MajorAxisLength.append(L)
l = min(ellipse[1]); MinorAxisLength.append(l)
A = cv.contourArea(contours_list[indx][0]); AreaContour.append(A)
P = cv.arcLength(contours_list[indx][0], True); PerimeterContour.append(P)
ED = np.sqrt((4*A) / (np.pi)); EquivalentDiameter.append(ED)
Co = ED/L; Compactness.append(Co)
SF1 = L/A; Shape_Factor_1.append(SF1)
SF2 = l/A; Shape_Factor_2.append(SF2)
# Test image dimension features
test_ellipse = fit_ellipse(test_contour)
print("Test image features:")
print("Major axis length: ", max(test_ellipse[1]))
print("Minor axis length: ", min(test_ellipse[1]))
print("Contour area: ", cv.contourArea(test_contour))
print("Contour perimeter: ", cv.arcLength(test_contour, True))
test_ED = np.sqrt((4*A) / (np.pi)) # Used later for comparison
print("Equivalent diameter: ", test_ED)
print("Compactness: ", np.sqrt((4*cv.contourArea(test_contour)/np.pi))/max(test_ellipse[1]))
print("Shape factor 1: ", max(test_ellipse[1])/cv.contourArea(test_contour))
print("Shape factor 2: ", min(test_ellipse[1])/cv.contourArea(test_contour))
Test image features: Major axis length: 194.51629638671875 Minor axis length: 50.02014923095703 Contour area: 7409.5 Contour perimeter: 430.8355675935745 Equivalent diameter: 80.27369018492085 Compactness: 0.4993367365137288 Shape factor 1: 0.026252283742049902 Shape factor 2: 0.006750813041494977
The third part included ellipse fitting, and plotting of each species.
The ellipses were collected to a list which was later used to calculate the features.
The test images dimensions are also calculated:
Main sources: Original article and
https://docs.opencv.org/3.4/d6/d6e/group__imgproc__draw.html#ga28b2267d35786f5f890ca167236cbc69
All features were gathered into a dataframe, with each row representing one sample and its associated feature values. Information about the original image and the rice species label were also included for each data point. This data was saved as a parquet file in the 'training_data' folder.
# For image label index
species_index =[]
for i in range(5):
for j in range(100):
species_index.append(i+1)
# Inserting the features into a dataframe
featuresDF = pd.DataFrame({
'Label': [os.path.basename(os.path.dirname(species)) for species in subset],
'Original_Image_Path': subset,
'ImageIndex' : species_index,
'MajorAxisLength': MajorAxisLength,
'MinorAxisLength': MinorAxisLength,
'AreaContour': AreaContour,
'PerimeterContour': PerimeterContour,
'EquivalentDiameter': EquivalentDiameter,
'Compactness': Compactness,
'Shape_Factor_1': Shape_Factor_1,
'Shape_Factor_2': Shape_Factor_2
})
# Adding the Y, Cr, Cb statistics to the dataframe
featuresDF['Y_mean'] = [y[0] for y in y_stats]
featuresDF['Y_var'] = [y[1] for y in y_stats]
featuresDF['Y_skew'] = [y[2] for y in y_stats]
featuresDF['Y_kurt'] = [y[3] for y in y_stats]
featuresDF['Cr_mean'] = [cr[0] for cr in cr_stats]
featuresDF['Cr_var'] = [cr[1] for cr in cr_stats]
featuresDF['Cr_skew'] = [cr[2] for cr in cr_stats]
featuresDF['Cr_kurt'] = [cr[3] for cr in cr_stats]
featuresDF['Cb_mean'] = [cb[0] for cb in cb_stats]
featuresDF['Cb_var'] = [cb[1] for cb in cb_stats]
featuresDF['Cb_skew'] = [cb[2] for cb in cb_stats]
featuresDF['Cb_kurt'] = [cb[3] for cb in cb_stats]
# Saving the dataframe to /training_data as a parquet file
featuresDF.to_parquet('../training_data/Features.parquet')
Maximum variance of the Cr component for each rice species.
featuresDF.groupby('Label').Cr_var.max()
Label Arborio 1.473677 Basmati 7.572945 Ipsala 4.995914 Jasmine 14.479954 Karacadag 1.778424 Name: Cr_var, dtype: float64
Minimum equivalent diameter for each rice species
featuresDF.groupby('Label').EquivalentDiameter.min()
Label Arborio 72.273544 Basmati 72.800135 Ipsala 101.422395 Jasmine 69.854782 Karacadag 72.158943 Name: EquivalentDiameter, dtype: float64
Minimum, maximum and median equivalent diameter for Basmati rice samples.
This is compared to the equivalent diameter value of the test_image.
# Basmati equivalent diameter statistics compared to the test image
featuresDF_basmati = featuresDF.iloc[100:200]
print(featuresDF_basmati.EquivalentDiameter.describe())
print('Test image equivalent diameter: ', test_ED)
count 100.000000 mean 85.217352 std 5.050823 min 72.800135 25% 81.896929 50% 85.825367 75% 89.033770 max 94.713300 Name: EquivalentDiameter, dtype: float64 Test image equivalent diameter: 80.27369018492085
# Plotting min and max compactness images
plt.subplot(1, 2, 1)
plt.imshow(image_list[featuresDF[featuresDF.Compactness == featuresDF.Compactness.max()].index[0]])
plt.subplot(1, 2, 2)
plt.imshow(image_list[featuresDF[featuresDF.Compactness == featuresDF.Compactness.min()].index[0]])
<matplotlib.image.AxesImage at 0x14ac4363b50>
In order to assess the effect of the compactness value on the shape of the rice, we can illustrate it by plotting the images with the maximum and minimum compactness values, as presented by the code above.
The pictures show that the higher the compactness value, the rounder the rice.
The last part included gathering all of the features in to a dataframe (featureDF).
For each species, these features were used to determine maximum variance of the Cr component and minimum equivalent diameter. The minimum, maximum and median ED was also calculated for the Basmati samples.
Lastly, the compactness values affect was evaluated.
Main source: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
Import training data
df = pd.read_parquet('../training_data/Features.parquet')
Standardization using z-score
# Extract numeric data and standardize
scaler = StandardScaler()
num_data = df.drop(df.columns[[0, 1, 2]], axis=1)
num_data_z = pd.DataFrame(scaler.fit_transform(num_data), columns=num_data.columns)
Histograms
# Labels
labels = num_data_z.columns
#Species
species = df['Label'].unique()
# Colors
colors = ['blue', 'green', 'red', 'cyan', 'magenta']
num_rows = 5
num_cols = 4
# Create the figure and axes
fig, axes = plt.subplots(num_rows, num_cols, figsize=(30, 30))
# Flatten the axes to iterate over them
axes = axes.flatten()
# Iterate over each feature and plot it
for i, feature in enumerate(num_data_z.columns):
for j in range(5): # Iterate over each segment of the data
ax = axes[i]
ax.hist(num_data_z.iloc[j * 100: (j + 1) * 100][feature], color=colors[j], label=species[j], edgecolor='black', bins=15)
ax.set_title(feature)
ax.set_xlabel('Value')
ax.set_ylabel('Count')
ax.legend()
plt.tight_layout()
plt.show()
# Convert Nan values to 0
num_data_z.fillna(0, inplace=True)
Pairplots
# Add image index to copy for hue
num_data_z_ind = num_data_z.copy()
num_data_z_ind['ImageIndex'] = df['ImageIndex']
# Pairplot
sns.pairplot(num_data_z_ind, hue='ImageIndex')
<seaborn.axisgrid.PairGrid at 0x14ae9045b50>